Goto

Collaborating Authors

 similarity score



Reconciling Competing Sampling Strategies of Network Embedding

Neural Information Processing Systems

Network embedding plays a significant role in a variety of applications. To capture the topology of the network, most of the existing network embedding algorithms follow a sampling training procedure, which maximizes the similarity (e.g., embedding vectors' dot product) between positively sampled node pairs and minimizes the similarity between negatively sampled node pairs in the embedding space. Typically, close node pairs function as positive samples while distant node pairs are usually considered as negative samples. However, under different or even competing sampling strategies, some methods champion sampling distant node pairs as positive samples to encapsulate longer distance information in link prediction, whereas others advocate adding close nodes into the negative sample set to boost the performance of node recommendation. In this paper, we seek to understand the intrinsic relationships between these competing strategies. To this end, we identify two properties (discrimination and monotonicity) that given any node pair proximity distribution, node embeddings should embrace. Moreover, we quantify the empirical error of the trained similarity score w.r.t. the sampling strategy, which leads to an important finding that the discrimination property and the monotonicity property for all node pairs can not be satisfied simultaneously in real-world applications. Guided by such analysis, a simple yet novel model (SENSEI) is proposed, which seamlessly fulfills the discrimination property and the partial monotonicity within the top-K ranking list. Extensive experiments show that SENSEI outperforms the state-of-the-arts in plain network embedding.


Supplementary for Mixed Supervised Object Detection by Transferring Mask Prior and Semantic Similarity

Neural Information Processing Systems

In this supplementary material, we will provide more analyses of mask prior in Section 1 and similarity transfer in Section 2. We will show the visualization results in Section 3 and the performance variance with iteration in Section 4. We will also conduct experiments to mine base categories in the target dataset in Section 5. Besides, the hyper-parameters analyses will be provided in Section 6. Finally, we will discuss the limitations in Section 7. As mentioned in Section 3.2 in the main paper, mask prior provides coarse pixel-wise category information to improve the ability of the object detection network to locate and identify objects. Our ablation studies (Table 3 in the main paper) have already proved the advantage of mask prior. To further evaluate the effectiveness of mask prior, we evaluate object detection network with/without mask generator on VOC test set. Considering that the target dataset may contain both base categories and novel categories, in which only novel categories have ground-truth bounding boxes, we evaluate our method on novel categories.




Unsupervised Learning of Spoken Language with Visual Context

Neural Information Processing Systems

Humans learn to speak before they can read or write, so why can't computers do the same? In this paper, we present a deep neural network model capable of rudimentary spoken language acquisition using untranscribed audio training data, whose only supervision comes in the form of contextually relevant visual images. We describe the collection of our data comprised of over 120,000 spoken audio captions for the Places image dataset and evaluate our model on an image search and annotation task. We also provide some visualizations which suggest that our model is learning to recognize meaningful words within the caption spectrograms.